Artificial Intelligence in Medicine
○ Elsevier BV
All preprints, ranked by how well they match Artificial Intelligence in Medicine's content profile, based on 15 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Patel, D.; Timsina, P.; Gorenstein, L.; Glicksberg, B. S.; Raut, G.; Cheetirala, S.; Santana, F.; Tamegue, J.; Kia, A.; zimlichman, E.; Levin, M.; Freeman, R.; Klang, E.
Show abstract
Predicting hospitalization from nurse triage notes has significant implications in health informatics. To this end, we compared the performance of the deep-learning transformer-based model, bio-clinical-BERT, with a bag-of-words logistic regression model incorporating term frequency-inverse document frequency (BOW-LR-tf-idf). A retrospective analysis was conducted using data from 1,391,988 Emergency Department patients at the Mount Sinai Health System spanning 2017-2022. The models were trained on four hospitals data and externally validated on a fifth. Bio-clinical-BERT achieved higher AUCs (0.82, 0.84, and 0.85) compared to BOW-LR-tf-idf (0.81, 0.83, and 0.84) across training sets of 10,000, 100,000, and [~]1,000,000 patients respectively. Notably, both models proved effective at utilizing triage notes for prediction, despite the modest performance gap. Importantly, our findings suggest that simpler machine learning models like BOW-LR-tf-idf could serve adequately in resource-limited settings. Given the potential implications for patient care and hospital resource management, further exploration of alternative models and techniques is warranted to enhance predictive performance in this critical domain.
Ferri, P.; Saez, C.; Felix-De Castro, A.; Sanchez-Cuesta, P.; Garcia-Gomez, J. M.
Show abstract
When developing Machine Learning models to support emergency medical triage, it is important to consider how changes over time in the data can negatively affect the models performance. The objective of this study was to assess the effectiveness of novel Deep Continual Learning pipelines in maximizing model performance when input features are subject to change over time, including the emergence of new features and the disappearance of existing ones. The model is designed to identify life-threatening situations, predict its admissible response delay, and determine its institutional jurisdiction. We analyzed a total of 1 414 575 events spanning from 2009 to 2019. Our findings demonstrate important performance improvements, up to 4.9% in life-threatening, 18.5% in response delay and 1.7% in jurisdiction, in absolute F1-score, compared to the current triage protocol, and improvements up to 4.4% in life-threatening and 11% in response delay, in absolute F1-score, respect to non-continual approaches.
Di Noto, T.; Atat, C.; Teiga, E. G.; Hegi, M.; Hottinger, A.; Cuadra, M. B.; Hagmann, P.; Richiardi, J.
Show abstract
Natural Language Processing (NLP) on electronic health records (EHRs) can be used to monitor the evolution of pathologies over time to facilitate diagnosis and improve decision-making. In this study, we designed an NLP pipeline to classify Magnetic Resonance Imaging (MRI) radiology reports of patients with high-grade gliomas. Specifically, we aimed to distinguish reports indicating changes in tumors between one examination and the follow-up examination (treatment response/tumor progression versus stability). A total of 164 patients with 361 associated reports were retrieved from routine imaging, and reports were labeled by one radiologist. First, we assessed which embedding is more suitable when working with limited data, in French, from a specific domain. To do so, we compared a classic embedding techniques, TF-IDF, to a neural embedding technique, Doc2Vec, after hyperparameter optimization for both. A random forest classifier was used to classify the reports into stable (unchanged tumor) or unstable (changed tumor). Second, we applied the post-hoc LIME explainability tool to understand the decisions taken by the model. Overall, classification results obtained in repeated 5-fold cross-validation with TF-IDF reached around 89% AUC and were significantly better than those achieved with Doc2Vec (Wilcoxon signed-rank test, P = 0.009). The explainability toolkit run on TF-IDF revealed some interesting patterns: first, words indicating change such as progression were rightfully frequent for reports classified as unstable; similarly, words indicating no change such as not were frequent for reports classified as stable. Lastly, the toolkit discovered misleading words such as T2 which are clearly not directly relevant for the task. All the code used for this study is made available.
Sezgin, E.; Sirrianni, J.; Kranz, K.
Show abstract
ObjectiveWe present a proof-of-concept digital scribe system as an ED clinical conversation summarization pipeline and report its performance. Materials and MethodsWe use four pre-trained large language models to establish the digital scribe system: T5-small, T5-base, PEGASUS-PubMed, and BART-Large-CNN via zero-shot and fine-tuning approaches. Our dataset includes 100 referral conversations among ED clinicians and medical records. We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance. In addition, we annotated transcriptions to assess the quality of generated summaries. ResultsThe fine-tuned BART-Large-CNN model demonstrates greater performance in summarization tasks with the highest ROUGE scores (F1ROUGE-1=0.49, F1ROUGE-2=0.23, F1ROUGE-L=0.35) scores. In contrast, PEGASUS-PubMed lags notably (F1ROUGE-1=0.28, F1ROUGE-2=0.11, F1ROUGE-L=0.22). BART-Large-CNNs performance decreases by more than 50% with the zero-shot approach. Annotations show that BART-Large-CNN performs 71.4% recall in identifying key information and a 67.7% accuracy rate. DiscussionThe BART-Large-CNN model demonstrates a high level of understanding of clinical dialogue structure, indicated by its performance with and without fine-tuning. Despite some instances of high recall, there is variability in the models performance, particularly in achieving consistent correctness, suggesting room for refinement. The models recall ability varies across different information categories. ConclusionThe study provides evidence towards the potential of AI-assisted tools in reducing clinical documentation burden. Future work is suggested on expanding the research scope with larger language models, and comparative analysis to measure documentation efforts and time.
Cai, L.; Zhang, T.; Beets-Tan, R.; Brunekreef, J.; Teuwen, J.
Show abstract
The use of Electronic Health Records (EHRs) has increased significantly in recent years. However, a substantial portion of the clinical data remains in unstructured text formats, especially in the context of radiology. This limits the application of EHRs for automated analysis in oncology research. Pretrained language models have been utilized to extract feature embeddings from these reports for downstream clinical applications, such as treatment response and survival prediction. However, a thorough investigation into which pretrained models produce the most effective features for rectal cancer survival prediction has not yet been done. This study explores the performance of five Dutch pretrained language models, including two publicly available models (RobBERT and MedRoBERTa.nl) and three developed in-house for the purpose of this study (RecRoBERT, BRecRoBERT, and BRec2RoBERT) with training on distinct Dutch-only corpora, in predicting overall survival and disease-free survival outcomes in rectal cancer patients. Our results showed that our in-house developed BRecRoBERT, a RoBERTa-based language model trained from scratch on a combination of Dutch breast and rectal cancer corpora, delivered the best predictive performance for both survival tasks, achieving a C-index of 0.65 (0.57, 0.73) for overall survival and 0.71 (0.64, 0.78) for disease-free survival. It outperformed models trained on general Dutch corpora (RobBERT) or Dutch hospital clinical notes (MedRoBERTa.nl). BRecRoBERT demonstrated the potential capability to predict survival in rectal cancer patients using Dutch radiology reports at diagnosis. This study highlights the value of pretrained language models that incorporate domain-specific knowledge for downstream clinical applications. Furthermore, it proves that utilizing data from related domains can improve the quality of feature embeddings for certain clinical tasks, particularly in situations where domain-specific data is scarce.
Gao, Y.; Myers, S.; Chen, S.; Dligach, D.; Miller, T.; Bitterman, D. S.; Chen, G.; Mayampurath, A.; Churpek, M. M.; Afshar, M.
Show abstract
Large language models (LLMs) are being explored for diagnostic decision support, yet their ability to estimate pre-test probabilities, vital for clinical decision-making, remains limited. This study evaluates two LLMs, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks. We examined three current methods of extracting LLM probability estimations and revealed their limitations. We aim to highlight the need for improved techniques in LLM confidence estimation.
Niset, A.; Melot, I.; Pireau, M.; Englebert, A.; Scius, N.; Flament, J.; El Hadwe, S.; Al Barajraji, M.; Thonon, H.; Barrit, S.
Show abstract
BackgroundEmergency departments face increasing pressures from staff shortages, patient surges, and administrative burdens. While large language models (LLMs) show promise in clinical support, their deployment in emergency medicine presents technical and regulatory challenges. Previous studies often relied on simplistic evaluations using public datasets, overlooking real-world complexities and data privacy concerns. MethodsAt a tertiary emergency department, we retrieved 79 consecutive cases during a peak 24-hour period constituting a siloed dataset. We evaluated six pipelines combining open- and closed-source embedding models (text-embedding- ada-002 and MXBAI) with foundational models (GPT-4, Llama3, and Qwen2), grounded through retrieval-augmented generation with emergency medicine textbooks. The models top-five diagnostic predictions on early clinical data were compared against reference diagnoses established through expert consensus based on complete clinical data. Outcomes included diagnostic inclusion rate, ranking performance, and citation sourcing capabilities. ResultsAll pipelines showed comparable diagnostic inclusion rates (62.03-72.15%) without significant differences in pairwise comparisons. Case characteristics, rather than model combinations, significantly influenced predictive diagnostic performance. Cases with specific diagnoses were significantly more diagnosed versus unspecific ones (85.53% vs. 31.41%, p<0.001), as did surgical versus medical cases (79.49% vs. 56.25%, p<0.001). Open-source foundational models demonstrated superior sourcing capabilities compared to GPT-4-based combinations (OR: 33.92 to {infty}, p<1.4e-12), with MBXAI/Qwen2 achieving perfect sourcing. ConclusionOpen and closed-source LLMs showed promising and comparable predictive diagnostic performance in a real-world emergency setting when evaluated on siloed data. Case characteristics emerged as the primary determinant of performance, suggesting that current limitations reflect AI alignment fundamental challenges in medical reasoning rather than model-specific constraints. Open-source models demonstrated superior sourcing capabilities--a critical advantage for interpretability. Continued research exploring larger-scale, multi-centric efforts, including real-time applications and human-computer interactions, as well as real- world clinical benchmarking and sourcing verification, will be key to delineating the full potential of grounded LLM-driven diagnostic assistance in emergency medicine.
Bernaola, N.; De Lima, G.; Riano, M.; Llanos, L.; Heili-Frades, S.; Sanchez, O.; Lara, A.; Plaza, G.; Carballo, C.; Gallego, P.; Larranaga, P.; Bielza, C.
Show abstract
ObjectivesTo present a model that enhances the accuracy of clinicians when presented with a possibly critical Covid-19 patient. MethodsA retrospective study was performed with information of 5,745 SARS-CoV2 infected patients admitted to the Emergency room of 4 public Hospitals in Madrid belonging to Quiron Salud Health Group (QS) from March 2020 to February 2021. Demographics, clinical variables on admission, laboratory markers and therapeutic interventions were extracted from Electronic Clinical Records. Traits related to mortality were found through difference in means testing and through feature selection by learning multiple classification trees with random initialization and selecting the ones that were used the most. We validated the model through cross-validation and tested generalization with an external dataset from 4 hospitals belonging to Sanitas Hospitals Health Group. The usefulness of two different models in real cases was tested by measuring the effect of exposure to the model decision on the accuracy of medical professionals. ResultsOf the 5,745 admitted patients, 1,173 died. Of the 110 variables in the dataset, 34 were found to be related with our definition of criticality (death in <72 hours) or all-cause mortality. The models had an accuracy of 85% and a sensitivity of 50% averaged through 5-fold cross validation. Similar results were found when validating with data from the 4 hospitals from Sanitas. The models were found to have 11% better accuracy than doctors at classifying critical cases and improved accuracy of doctors by 12% for non-critical patients, reducing the cost of mistakes made by 17%.
Miao, B. Y.; Rodriguez Almaraz, E.; Ashraf Ganjouei, A.; Suresh, A.; Zack, T.; Bravo, M.; Raghavendran, S.; Oskotsky, B.; Alaa, A.; Butte, A. J.
Show abstract
BackgroundMolecular biomarkers play a pivotal role in the diagnosis and treatment of oncologic diseases but staying updated with the latest guidelines and research can be challenging for healthcare professionals and patients. Large Language Models (LLMs), such as MedPalm-2 and GPT-4, have emerged as potential tools to streamline biomedical information extraction, but their ability to summarize molecular biomarkers for oncologic disease subtyping remains unclear. Auto-generation of clinical nomograms from text guidelines could illustrate a new type of utility for LLMs. MethodsIn this cross-sectional study, two LLMs, GPT-4 and Claude-2, were assessed for their ability to generate decision trees for molecular subtyping of oncologic diseases with and without expert-curated guidelines. Clinical evaluators assessed the accuracy of biomarker and cancer subtype generation, as well as validity of molecular subtyping decision trees across five cancer types: colorectal cancer, invasive ductal carcinoma, acute myeloid leukemia, diffuse large B-cell lymphoma, and diffuse glioma. ResultsBoth GPT-4 and Claude-2 "off the shelf" successfully produced clinical decision trees that contained valid instances of biomarkers and disease subtypes. Overall, GPT-4 and Claude-2 showed limited improvement in the accuracy of decision tree generation when guideline text was added. A Streamlit dashboard was developed for interactive exploration of subtyping trees generated for other oncologic diseases. ConclusionThis study demonstrates the potential of LLMs like GPT-4 and Claude-2 in aiding the summarization of molecular diagnostic guidelines in oncology. While effective in certain aspects, their performance highlights the need for careful interpretation, especially in zero-shot settings. Future research should focus on enhancing these models for more nuanced and probabilistic interpretations in clinical decision-making. The developed tools and methodologies present a promising avenue for expanding LLM applications in various medical specialties. Key Points- Large language models, such as GPT-4 and Claude-2, can generate clinical decision trees that summarize best-practice guidelines in oncology - Providing guidelines in the prompt query improves the accuracy of oncology biomarker and cancer subtype information extraction - However, providing guidelines in zero-shot settings does not significantly improve generation of clinical decision trees for either GPT-4 or Claude-2
Benani, A.; Ohayon, S.; Laleye, F.; Bauvin, P.; Messas, E.; Bodard, S.; Tannier, X.
Show abstract
Machine learning has demonstrated success in clinical decision-making, yet the added value of multimodal approaches over unimodal models remains unclear. This systematic review evaluates studies comparing multimodal and unimodal ML algorithms for diagnosis, prognosis, or prescription. A comprehensive search of MEDLINE up to January 2025 identified 97 studies across 12 medical specialties, with oncology being the most represented. The most common data fusion involved tabular data and images (67%). A risk of bias assessment using PROBAST revealed that 57% of studies had a low risk of bias, while 41% had a high risk. Multimodality outperformed unimodality in 91% cases. No correlation between dataset sample size and added performance has been observed. However, considerable methodological heterogeneity and potential publication bias warrant caution in interpretation. Further research is needed to refine evaluation metrics and hybrid model architectures based on specific clinical tasks. MeSH TermsHumans [B01.050.150.900.649.313.988.400.112.400.400], Machine Learning [L01.224.050.375.530], Clinical Decision-Making [E01.055], Systematic Review [V03.850].
Ishaque, A. H.; Boutet, A.; Hiremath, S. B.; Mullarkey, M. P.; Peris-Celda, M.; Zadeh, G.
Show abstract
PurposeLarge language models (LLMs) have demonstrated advanced capabilities in interpreting text and visual inputs. Their potential to transform oncological practice is significant, but their accuracy and reliability in interpreting medical imaging and offering management suggestions remain underexplored. This study aimed to evaluate the performance of ChatGPT in interpreting T1-weighted contrast-enhanced MRI images of meningiomas and glioblastomas and providing treatment recommendations based on simulated patient inquiries. MethodsThis observational cohort study utilized publicly available MRI datasets. Thirty cases of meningiomas and glioblastomas were randomly selected, yielding 90 images (three orthogonal planes per case). ChatGPT-4o was tasked with interpreting these images and responding to six standardized patient-simulated questions. Two neuroradiologists and neurosurgeons assessed ChatGPTs performance using five-point Likert scales and their inter-rater agreement was evaluated. ResultsChatGPT identified MRI sequences with 91.7% accuracy and localized tumors correctly in 66.7% of cases. Tumor size was qualitatively described in 85% of cases, and the median acceptability was rated as 4.0 (IQR 4.0-5.0) by neuroradiologists. ChatGPT included meningioma in the differential diagnosis for 73.3% of meningioma cases and glioma in 83.3% of glioblastoma cases. Inter-rater agreement among neuroradiologists ranged from moderate to good ({kappa} = 0.45-0.72). While surgical treatment was suggested in all symptomatic cases, neurosurgeon acceptability ratings varied, with poor inter-rater reliability. ConclusionsChatGPT demonstrates potential in interpreting neuro-oncological MRI images and offering preliminary management recommendations. However, errors in tumor localization and variability in recommendation acceptability underscore the need for physician oversight and further refinement of LLMs before clinical integration.
Dwivedi, K.; Mahbod, A.; Ecker, R. C.; Janjic, K.
Show abstract
Oral squamous cell carcinoma (OSCC) accounts for a major part of cancer mortality, with survival outcomes highly dependent on early diagnosis. While many approaches have been proposed for OSCC survival prediction, they often rely on unimodal data, which may be suboptimal. In this study, we introduced a unified cross-attention-based deep learning framework that integrates whole-slide histopathology images (WSIs) and transcriptomic data from OSCC patients for survival prediction. The framework employed an autoencoder for transcriptomic feature extraction and a state-of-the-art pathology foundation model--evaluated across five alternatives--to derive WSI embeddings. These embeddings were subsequently integrated using cross-attention and concatenation within a Cox proportional hazards model. The multimodal approach outperformed nearly all unimodal counterparts, achieving a maximum concordance index of 0.780{+/-}0.059 with cross-attention and 0.766{+/-}0.050 with concatenation. The results indicate that pathotranscriptomic integration could improve survival prediction for OSCC patients. The implementation is available on GitHub at: https://github.com/kountaydwivedi/multimodal fusion.git. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
Chen, L.-C.; Zack, T.; Demirci, A.; Sushil, M.; Miao, B.; Kasap, C.; Butte, A. J.; Collisson, E.; Hong, J.
Show abstract
PurposeWe examined the effectiveness of proprietary and open Large Language Models (LLMs) in detecting disease presence, location, and treatment response in pancreatic cancer from radiology reports. MethodsWe analyzed 203 deidentified radiology reports, manually annotated for disease status, location, and indeterminate nodules needing follow-up. Utilizing GPT-4, GPT-3.5-turbo, and open models like Gemma-7B and Llama3-8B, we employed strategies such as ablation and prompt engineering to boost accuracy. Discrepancies between human and model interpretations were reviewed by a secondary oncologist. ResultsAmong 164 pancreatic adenocarcinoma patients, GPT-4 showed the highest accuracy in inferring disease status, achieving a 75.5% correctness (F1-micro). Open models Mistral-7B and Llama3-8B performed comparably, with accuracies of 68.6% and 61.4%, respectively. Mistral-7B excelled in deriving correct inferences from "Objective Findings" directly. Most tested models demonstrated proficiency in identifying disease containing anatomical locations from a list of choices, with GPT-4 and Llama3-8B showing near parity in precision and recall for disease site identification. However, open models struggled with differentiating benign from malignant post-surgical changes, impacting their precision in identifying findings indeterminate for cancer. A secondary review occasionally favored GPT-3.5s interpretations, indicating the variability in human judgment. ConclusionLLMs, especially GPT-4, are proficient in deriving oncological insights from radiology reports. Their performance is enhanced by effective summarization strategies, demonstrating their potential in clinical support and healthcare analytics. This study also underscores the possibility of zero-shot open model utility in environments where proprietary models are restricted. Finally, by providing a set of annotated radiology reports, this paper presents a valuable dataset for further LLM research in oncology.
Corso, F.; Peppoloni, V.; Mazzeo, L.; Leone, G.; Passos, L.; Miskovic, V.; Armanini, J.; Ferrarin, A.; Wiest, I. C.; Wolf, F.; Montelatici, G.; Romano', R.; Ambrosini, P.; Capoccia, T.; Natangelo, S.; Rota, S.; Andena, P.; De Ponti, M.; Russo, A.; Stasi, G.; Provenzano, L.; Spagnoletti, A.; Meazza Prina, M.; Cavalli, C.; Giani, C.; Serino, R.; Borraccino, M.; Bonalume, C.; Di Mauro, R. M.; Agosta, C.; Dumitrascu, A. D.; Di Liberti, G.; Corrao, G.; Beninato, T.; Ganzinelli, M.; Occhipinti, M.; Brambilla, M.; Proto, C.; Kather, J. N.; Pedrocchi, A. L. G.; De Braud, F.; Lo Russo, G.; Baili, P.; P
Show abstract
Real-world data (RWD), largely stored in unstructured electronic health records (EHRs), are critical for understanding complex diseases like cancer. However, extracting structured information from these narratives is challenging due to linguistic variability, semantic complexity, and privacy concerns. This study evaluates the performance of four locally deployable and small language models (SLMs), LLaMA, Mistral, BioMistral, and MedLLaMA, for information extraction (IE) from Italian EHRs within the APOLLO 11 trial on non-small cell lung cancer (NSCLC). We examined three prompting strategies (zero-shot, few-shot, and annotated few-shot) across English and Italian, involving clinicians with varying expertise to assess prompt designs impact on accuracy. Results show that general-purpose models (e.g., LLaMA 3.1 8B) outperform biomedical models in most tasks, particularly in extracting binary features. Multiclass variables such as TNM staging, PD-L1, and ECOG were more difficult due to implicit language and lack of standardization. Few-shot prompting and native-language inputs significantly improved performance and reduced hallucinations. Clinical expertise enhanced consistency in annotation, particularly among students using annotated examples. The study confirms that privacy-preserving SLMs can be deployed locally for efficient and secure cancer data extraction. Findings highlight the need for hybrid systems combining SLMs with expert input and underline the importance of aligning clinical documentation practices with SLM capabilities. This is the first study to benchmark SLMs on Italian EHRs and investigate the role of clinical expertise in prompt engineering, offering valuable insights for the future integration of SLMs into real-world clinical workflows.
Sorin, V.; Glicksberg, B. S.; Barash, Y.; Konen, E.; Nadkarni, G.; Klang, E.
Show abstract
PurposeRecently introduced Large Language Models (LLMs) such as ChatGPT have already shown promising results in natural language processing in healthcare. The aim of this study is to systematically review the literature on the applications of LLMs in breast cancer diagnosis and care. MethodsA literature search was conducted using MEDLINE, focusing on studies published up to October 22nd, 2023, using the following terms: "large language models", "LLM", "GPT", "ChatGPT", "OpenAI", and "breast". ResultsFive studies met our inclusion criteria. All studies were published in 2023, focusing on ChatGPT-3.5 or GPT-4 by OpenAI. Applications included information extraction from clinical notes, question-answering based on guidelines, and patients management recommendations. The rate of correct answers varied from 64-98%, with the highest accuracy (88-98%) observed in information extraction and question-answering tasks. Notably, most studies utilized real patient data rather than data sourced from the internet. Limitations included inconsistent accuracy, prompt sensitivity, and overlooked clinical details, highlighting areas for cautious LLM integration into clinical practice. ConclusionLLMs demonstrate promise in text analysis tasks related to breast cancer care, including information extraction and guideline-based question-answering. However, variations in accuracy and the occurrence of erroneous outputs necessitate validation and oversight. Future works should focus on improving reliability of LLMs within clinical workflow.
Harchandani, S.; Quinn, R.; Mittal, K.; Martin, A.; Wang, M.-J.; Holstead, R. G.
Show abstract
The expanding capacity of large language models allow for improvements in patient and provider healthcare quality and experience. The medical oncology consultation often includes a discussion of a life-limiting diagnosis and complex treatment protocols. Patient recall from the discussion may be limited, and it is possible that a patient specific written summary could help with understanding, recall, and overall experience. Using a privacy compliant large language model, a prompt was instructed to rewrite an ambulatory medical consultation note as a patient friendly summary, capturing key details from a diagnosis and treatment plan. The summary was provided to both provider and patient for review and a 5-point Likert survey was administered inquiring on the outputs accuracy, clarity, and helpfulness. Patients reported agreement in 100%, 100%, and 87% on each topic respectively. 93% of patients recommended the use of similar summaries in the future. Providers reported agreement in 98%, 91%, and 96% for accuracy, clarity, and empathy respectively. All providers (100%) recommended similar summaries to be used in the future. Some of the summaries retained jargon and results from this study will be used to optimize the prompt for an expanded study. In conclusion, a patient-friendly summary derived from a medical note using a large language model prompt was helpful to patients, and found to be useful for providers Author SummaryAs medical oncology providers, our new patient consultation appointments often require disclosing the diagnosis of a cancer, and a discussion on prognosis, complex treatment plans, the potential for significant side effects, and a number of tests/procedures that are required prior to initiation of the care plan. Patients often benefit from friends or family who take notes during an appointment, however this is not always possible. Technological advances in natural language processing with large language models such as Chat GPT allow for translation of medical language into plain language. In this study, we used a prompt to rewrite a medical note into a summary of the patients oncologic diagnosis and care plan. We then provided this summary to patients and provider to assess their feedback on the value of these summaries. We found that both providers and patients found these summaries to be accurate and understandable. Both groups recommended further development of these summaries. We intend to optimize our summary production for future studies using findings and feedback from this project.
Adamson, B. J.; Waskom, M.; Blarre, A.; Kelly, J.; Krismer, K.; Nemeth, S.; Gipetti, J.; Ritten, J.; Harrison, K.; Ho, G.; Linzmayer, R.; Bansal, T.; Wilkinson, S.; Amster, G.; Estola, E.; Benedum, C. M.; Fidyk, E.; Estevez, M.; Shapiro, W.; Cohen, A. B.
Show abstract
BackgroundAs artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAIs ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability. MethodsWe applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (eg, clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (ie, not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information. ResultsWe developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates. ConclusionsNLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.
Khashei, I.; Presciani, D.; Martinelli, L. P.; Grosjean, S.
Show abstract
Retrieval-augmented generation (RAG) is increasingly adopted to ground clinical conversational agents in external knowledge sources, yet many deployed prototypes lack the observability required for standard RAG evaluation. In particular, retrieved documents and grounding context are often not logged, preventing direct assessment of retrieval quality and faithfulness. We report a post-hoc evaluation of EMSy, a clinical RAG-based chatbot prototype, based on 2,660 multi-turn conversations collected between January and September 2025. Rather than benchmarking performance, we adopt an evaluation strategy based exclusively on observable signals. The analysis combines an exploratory intent analysis conducted on a random subset of heterogeneous interactions, automated quality scores available at the message and conversation level, and explicit user feedback, with 96.0% of rated conversations receiving positive feedback. Results indicate that message-level minimum scores capture localized low-quality responses that are not reflected by average conversation-level metrics, while user feedback reflects aggregate interaction impressions. This case study illustrates how diagnostic insights can be obtained under limited observability and identifies implications for the design and evaluation of future clinical RAG systems.
Romagnoli, F.; Pellegrini, M.
Show abstract
BackgroundThe ideal of personalized medicine is to support the clinical decision process towards the right drug for the right patient at the right time, by using, among other diagnostic tools, molecular biomarkers that are specifically dependent on the patient status and on the therapeutic options. Several challenges must be overcome to realize this vision. Patients present a wide spectrum of genetic variability even before developing diseases, and disease like cancer add an extra layer of mutations, while only a very small fraction of such variants have diagnostic or prognostic value. Moreover it is also challenging to predict how the patient will respond to a specific drug based on the patients omic profiling, since any drug introduces further perturbations in the biochemical model. MethodsIn this paper we propose the method Personalized-DrugRank for joint prediction of therapy response and time-to-response for cancer patients undergoing pharmacological therapy after surgery. The method is based on personalizing the DrugMerge methodology for drug repositioning in order to extract a few synthetic indices useful as input to ML prediction tools. In particular the proposed methodology is a novel and principled approach to merging independent patient-specific transcriptomic data with drug perturbation data from cell line assays. One of the key novel features of our approach over the state of the art is the joint prediction of the response of the patient to therapy along with an estimate of the time-to-response (i.e the prediction of the time needed for the therapy to succeed or fail). FindingsWe tested our methodology on data from the TCGA (The Cancer Genome Atlas) Program for three cancer types (Breast, Stomach and Colorectal cancer), 10 pharmacological regimens and 13 homogeneous cohorts. For the therapy response prediction task we developed models that attain an average AUC performance 0.749, average pvalue 0.030, average accuracy 0.809 with balanced Positive and Negative Predicting Values. For the time-to-event prediction task we developed regression models for the 13 homogeneous cohorts that attain an average (geometric) Concordance Index performance 0.782 (max 0.904, min 0.651) with average log likelihood pvalue 0.004, improving in nine cohorts over 13 upon models based only on clinical parameters having average Concordance Index 0.678 and average p-value 0.006. Interestingly, we attain statistical significant results even with quite small therapy-homogeneous cohorts (ranging from a minimum of 7 patients to a maximum of 32). ConclusionsThe ability of predicting with high accuracy the response of a cancer patient to a chosen pharmacological therapeutic regimen along with an estimate of the time-to-response helps adapting the clinical decision process to the specific patient profile, thus increasing the likelihood of providing correct and timely therapeutic decisions.
Pham, T. D.; Marks, K.; Hughes, D.; Chatzopoulou, D.; Coulthard, P.; Holmes, S.
Show abstract
Head injuries are a leading global cause of mortality and disability, highlighting the critical need for advanced prognostic tools to inform clinical decision-making and optimize healthcare resource utilization. For the first time, this study introduces a cutting-edge artificial intelligence (AI) framework designed to predict mortality outcomes from head injury narratives. Leveraging deep learning-based natural language processing techniques, the framework identifies and extracts key features from unstructured text describing injury mechanisms and patient conditions to train predictive models. Validation was conducted on a diverse dataset of 1,500 head injury cases using a stratified holdout approach, with 90% allocated for training and 10% for testing. The one-dimensional convolutional neural network model demonstrated strong performance, achieving averagely 85% accuracy, 74% correct mortality prediction, 88% correct survival prediction, and an impressive area under the receiver operating characteristic curve of 0.91. This work highlights the transformative potential of AI in harnessing narrative clinical data to enhance prognostic accuracy, paving the way for more effective, evidence-based management of head injury patients.